[IR] Add llvm.masked.load.first.fault intrinsic by huntergr-arm · Pull Request #156470 · llvm/llvm-project

huntergr-arm · 2025-09-02T15:02:26Z

In order to support loading from addresses which may not be valid at
runtime without generating faults, we introduce the first fault load
intrinsic. Loading with this intrinsic will only generate a fault for
invalid accesses on the first element of the vector. Any subsequent
fault will be suppressed and the corresponding data will be poison.

This PR contains target-independent scalarization of the intrinsic so
that generic codegen works.

david-arm

It might be a good idea to split this patch up into two parts:

A patch to introduce the LangRef and intrinsic along with the scalarisation code (on by default for all targets).
A follow-on patch to add the TTI hook and the codegen.

What do you think?

david-arm · 2025-09-03T10:17:43Z

llvm/test/CodeGen/AArch64/masked-speculative-load.ll

+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+
+define { <4 x i32>, <4 x i1> } @speculative_load_v4i32_neon(ptr %p, <4 x i1> %mask) {


It might be worth having a separate test file for fixed-width perhaps? For fixed-width I think we want to have different RUN lines for normal SVE and streaming mode. For the latter we can't use NEON and so will have to fall back on SVE.

Done. I've only included fixed vector tests for now as we won't scalarize scalable vectors.

llvmbot · 2025-10-30T12:36:38Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

In order to support loading from addresses which may not be valid at
runtime without generating faults, we introduce the speculative load
intrinsic. Loading with this intrinsic will only generate a fault for
invalid accesses on the first element of the vector. Any subsequent
fault will be suppressed and the corresponding data will be poison.

This PR contains both target-independent and AArch64-specific codegen
for the intrinsic.

Full diff: https://github.com/llvm/llvm-project/pull/156470.diff

4 Files Affected:

(modified) llvm/docs/LangRef.rst (+59)
(modified) llvm/include/llvm/IR/Intrinsics.td (+8)
(modified) llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp (+66)
(added) llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll (+79)

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 1c6823be44dcb..92ea6554c4aba 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -27098,6 +27098,65 @@ The '``llvm.masked.compressstore``' intrinsic is designed for compressing data i
 
 Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
 
+.. _int_mspecload:
+
+'``llvm.masked.speculative.load.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic. The loaded data is a vector of any integer, floating-point or pointer data type.
+
+::
+
+      declare { <16 x float>, <16 x i1> } @llvm.masked.speculative.load.v16f32.p0(ptr <ptr>, i32 <alignment>, <16 x i1> <mask>)
+      declare { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr <ptr>, i32 <alignment>, <2 x i1> <mask>)
+      ;; The data is a vector of pointers
+      declare { <8 x ptr>, <8 x i1> } @llvm.masked.speculative.load.v8p0.p0(ptr <ptr>, i32 <alignment>, <8 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Reads a vector from memory according to the provided mask, suppressing faults
+for any lane beyond the first. The mask holds a bit for each vector lane, and
+is used to prevent memory accesses to the masked-off lanes. Inactive lanes will
+be zero in the result vector.
+
+Returns the loaded data and a mask indicating which lanes are valid, which may
+not be the same as the input mask depending on whether the processor encountered
+a reason to avoid loading from that address.
+
+Arguments:
+""""""""""
+
+The first argument is the base pointer for the load. The second argument is the
+alignment of the source location. It must be a power of two constant integer
+value. The third argument, mask, is a vector of boolean values with the same
+number of elements as the return type.
+
+Semantics:
+""""""""""
+
+The '``llvm.masked.speculative.load``' intrinsic is similar to the
+'``llvm.masked.load``' intrinsic, in that it conditionally loads values from
+memory into a vector based on a mask. However, it allows loading from addresses
+which may not be entirely safe. If the memory corresponding to the first element
+of the vector is inaccessible, then a fault will be raised as normal. For all
+subsequent lanes faults will be suppressed and the corresponding bit in the
+output mask will be marked inactive. The remaining elements in the output mask
+after a suppressed fault will also be marked inactive. Elements with active bits
+in the input mask will be poison values if the corresponding bit is inactive in
+the output mask.
+
+Reasons for marking output elements inactive are processor dependent; it may be
+a genuine fault, e.g. if the range of the data being loaded spans a page
+boundary and the page at the higher address is not mapped. But a given
+processor may also mark elements as inactive for other reasons, such as a cache
+miss. Code using this intrinsic must take this into account and not assume that
+inactive lanes signal the end of accessible memory. If more data should be
+loaded based on the semantics of the user code, then the base pointer should be
+advanced to the address of the first inactive element and a new speculative load
+attempted.
 
 Memory Use Markers
 ------------------
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 4d59ee8676b9e..74fed62fcc1dd 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2505,6 +2505,14 @@ def int_masked_compressstore:
             [IntrWriteMem, IntrArgMemOnly,
              NoCapture<ArgIndex<1>>]>;
 
+def int_masked_speculative_load
+    : DefaultAttrsIntrinsic<[llvm_anyvector_ty,
+                             LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+                            [llvm_anyptr_ty, llvm_i32_ty,
+                             LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+                            [IntrReadMem, IntrArgMemOnly, ImmArg<ArgIndex<1>>,
+                             NoCapture<ArgIndex<0>>]>;
+
 def int_experimental_vector_compress:
     DefaultAttrsIntrinsic<[llvm_anyvector_ty],
               [LLVMMatchType<0>, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
diff --git a/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp b/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
index 146e7d1047dd0..531d79f89f7cd 100644
--- a/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
+++ b/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
@@ -1031,6 +1031,68 @@ static void scalarizeMaskedVectorHistogram(const DataLayout &DL, CallInst *CI,
   ModifiedDT = true;
 }
 
+static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI,
+                                           DomTreeUpdater *DTU,
+                                           bool &ModifiedDT) {
+  // For a target without speculative/first-faulting load support, we can't
+  // actually scalarize accesses for all lanes. However, lanes beyond the
+  // first may be considered inactive due to reasons beyond a fault, so for
+  // generic 'scalarization' we can just load the first lane (if the
+  // corresponding input mask bit is active), then mark all other lanes as
+  // inactive in the output mask and embed the first lane into a vector of
+  // poison.
+  Value *Ptr = CI->getArgOperand(0);
+  Value *Align = CI->getArgOperand(1);
+  Value *Mask = CI->getArgOperand(2);
+  StructType *RetTy = cast<StructType>(CI->getType());
+  VectorType *DataTy = cast<VectorType>(RetTy->getElementType(0));
+  VectorType *MaskTy = cast<VectorType>(RetTy->getElementType(1));
+  Type *ScalarTy = DataTy->getScalarType();
+
+  MaybeAlign AlignVal = cast<ConstantInt>(Align)->getMaybeAlignValue();
+
+  IRBuilder<> Builder(CI->getContext());
+  BasicBlock *IfBlock = CI->getParent();
+  Builder.SetInsertPoint(CI);
+  Builder.SetCurrentDebugLocation(CI->getDebugLoc());
+  Value *EmptyMask = Constant::getNullValue(MaskTy);
+  Value *PoisonData = PoisonValue::get(DataTy);
+
+  // FIXME: If the mask is a constant, we can skip the extract.
+  Value *FirstActive =
+      Builder.CreateExtractElement(Mask, 0ul, Twine("first.active"));
+  Instruction *ThenTerm =
+      SplitBlockAndInsertIfThen(FirstActive, CI,
+                                /*Unreachable=*/false,
+                                /*BranchWeights=*/nullptr, DTU);
+
+  BasicBlock *ThenBlock = ThenTerm->getParent();
+  ThenBlock->setName("speculative.load.first.lane");
+  Builder.SetInsertPoint(ThenBlock->getTerminator());
+  LoadInst *Load = Builder.CreateAlignedLoad(ScalarTy, Ptr, AlignVal);
+  Value *OneLaneData = Builder.CreateInsertElement(PoisonData, Load, 0ul);
+  Value *OneLaneMask = Builder.CreateInsertElement(
+      EmptyMask, Constant::getAllOnesValue(MaskTy->getElementType()), 0ul);
+
+  Builder.SetInsertPoint(CI);
+  PHINode *ResData = Builder.CreatePHI(DataTy, 2);
+  ResData->addIncoming(PoisonData, IfBlock);
+  ResData->addIncoming(OneLaneData, ThenBlock);
+  PHINode *ResMask = Builder.CreatePHI(MaskTy, 2);
+  ResMask->addIncoming(EmptyMask, IfBlock);
+  ResMask->addIncoming(OneLaneMask, ThenBlock);
+
+  Value *Result = PoisonValue::get(RetTy);
+  Result = Builder.CreateInsertValue(Result, ResData, 0ul);
+  Result = Builder.CreateInsertValue(Result, ResMask, 1ul);
+  if (CI->hasName())
+    Result->setName(CI->getName() + ".first.lane.only");
+  CI->getParent()->setName("speculative.result");
+  CI->replaceAllUsesWith(Result);
+  CI->eraseFromParent();
+  ModifiedDT = true;
+}
+
 static bool runImpl(Function &F, const TargetTransformInfo &TTI,
                     DominatorTree *DT) {
   std::optional<DomTreeUpdater> DTU;
@@ -1170,8 +1232,12 @@ static bool optimizeCallInst(CallInst *CI, bool &ModifiedDT,
       scalarizeMaskedCompressStore(DL, HasBranchDivergence, CI, DTU,
                                    ModifiedDT);
       return true;
+    case Intrinsic::masked_speculative_load: {
+      scalarizeMaskedSpeculativeLoad(DL, CI, DTU, ModifiedDT);
+      return true;
     }
   }
+  }
 
   return false;
 }
diff --git a/llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll b/llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll
new file mode 100644
index 0000000000000..e8facb8a5683b
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll
@@ -0,0 +1,79 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s --check-prefixes=NEON
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s --check-prefixes=SVE
+
+define { <4 x i32>, <4 x i1> } @speculative_load_v4i32(ptr %p, <4 x i1> %mask) {
+; NEON-LABEL: speculative_load_v4i32:
+; NEON:       // %bb.0:
+; NEON-NEXT:    // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT:    umov w8, v0.h[0]
+; NEON-NEXT:    tbz w8, #0, .LBB0_2
+; NEON-NEXT:  // %bb.1: // %speculative.load.first.lane
+; NEON-NEXT:    adrp x8, .LCPI0_0
+; NEON-NEXT:    ldr s0, [x0]
+; NEON-NEXT:    ldr d1, [x8, :lo12:.LCPI0_0]
+; NEON-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; NEON-NEXT:    ret
+; NEON-NEXT:  .LBB0_2:
+; NEON-NEXT:    movi v1.2d, #0000000000000000
+; NEON-NEXT:    // implicit-def: $q0
+; NEON-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; NEON-NEXT:    ret
+;
+; SVE-LABEL: speculative_load_v4i32:
+; SVE:       // %bb.0:
+; SVE-NEXT:    // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT:    umov w8, v0.h[0]
+; SVE-NEXT:    tbz w8, #0, .LBB0_2
+; SVE-NEXT:  // %bb.1: // %speculative.load.first.lane
+; SVE-NEXT:    adrp x8, .LCPI0_0
+; SVE-NEXT:    ldr s0, [x0]
+; SVE-NEXT:    ldr d1, [x8, :lo12:.LCPI0_0]
+; SVE-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; SVE-NEXT:    ret
+; SVE-NEXT:  .LBB0_2:
+; SVE-NEXT:    movi v1.2d, #0000000000000000
+; SVE-NEXT:    // implicit-def: $q0
+; SVE-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; SVE-NEXT:    ret
+  %res = call { <4 x i32>, <4 x i1> } @llvm.masked.speculative.load.v4i32.p0(ptr %p, i32 16, <4 x i1> %mask)
+  ret { <4 x i32>, <4 x i1> } %res
+}
+
+;; FIXME: If we know the input mask is all-true and the vector is fully aligned,
+;;        we should be able to use a normal NEON load here.
+define { <2 x double>, <2 x i1> } @speculative_load_v2f64_all_true_fully_aligned(ptr %p) {
+; NEON-LABEL: speculative_load_v2f64_all_true_fully_aligned:
+; NEON:       // %bb.0: // %speculative.load.first.lane
+; NEON-NEXT:    adrp x8, .LCPI1_0
+; NEON-NEXT:    ldr d0, [x0]
+; NEON-NEXT:    ldr d1, [x8, :lo12:.LCPI1_0]
+; NEON-NEXT:    ret
+;
+; SVE-LABEL: speculative_load_v2f64_all_true_fully_aligned:
+; SVE:       // %bb.0: // %speculative.load.first.lane
+; SVE-NEXT:    ldr d0, [x0]
+; SVE-NEXT:    index z1.s, #1, #-1
+; SVE-NEXT:    // kill: def $d1 killed $d1 killed $z1
+; SVE-NEXT:    ret
+  %res = call { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr %p, i32 16, <2 x i1> <i1 true, i1 true>)
+  ret { <2 x double>, <2 x i1> } %res
+}
+
+define { <2 x double>, <2 x i1> } @speculative_load_v2f64_all_true_partially_aligned(ptr %p) {
+; NEON-LABEL: speculative_load_v2f64_all_true_partially_aligned:
+; NEON:       // %bb.0: // %speculative.load.first.lane
+; NEON-NEXT:    adrp x8, .LCPI2_0
+; NEON-NEXT:    ldr d0, [x0]
+; NEON-NEXT:    ldr d1, [x8, :lo12:.LCPI2_0]
+; NEON-NEXT:    ret
+;
+; SVE-LABEL: speculative_load_v2f64_all_true_partially_aligned:
+; SVE:       // %bb.0: // %speculative.load.first.lane
+; SVE-NEXT:    ldr d0, [x0]
+; SVE-NEXT:    index z1.s, #1, #-1
+; SVE-NEXT:    // kill: def $d1 killed $d1 killed $z1
+; SVE-NEXT:    ret
+  %res = call { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr %p, i32 8, <2 x i1> <i1 true, i1 true>)
+  ret { <2 x double>, <2 x i1> } %res
+}

huntergr-arm · 2025-10-30T12:37:12Z

It might be a good idea to split this patch up into two parts:

A patch to introduce the LangRef and intrinsic along with the scalarisation code (on by default for all targets).

A follow-on patch to add the TTI hook and the codegen.

What do you think?

Yep, that's fine. I'll use this PR for the intrinsic and raise the AArch64 PR after this lands.

MacDue · 2025-11-12T11:53:21Z

llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp

+static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI,
+                                           DomTreeUpdater *DTU,
+                                           bool &ModifiedDT) {
+  // For a target without speculative/first-faulting load support, we can't


I think it'd be good to have an IR test with -passes=scalarize-masked-mem-intrin, similar to llvm/test/Transforms/ScalarizeMaskedMemIntrin/AArch64/streaming-compatible-expand-masked-gather-scatter.ll.

MacDue · 2025-11-12T11:54:33Z

llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp

  ModifiedDT = true;
 }

+static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI,


Other scalarize* functions have an a doc comment with an example, I think that'd be worth adding here too.

MacDue · 2025-11-12T11:55:29Z

llvm/docs/LangRef.rst

+
+Syntax:
+"""""""
+This is an overloaded intrinsic. The loaded data is a vector of any integer, floating-point or pointer data type.


very nit: line wrap this

MacDue · 2025-11-12T12:54:24Z

llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp

+  Value *EmptyMask = Constant::getNullValue(MaskTy);
+  Value *PoisonData = PoisonValue::get(DataTy);
+
+  // FIXME: If the mask is a constant, we can skip the extract.


Is it worth handling this here? -passes=instcombine,simplifycfg would do the same.

MacDue · 2025-11-12T13:04:08Z

llvm/docs/LangRef.rst

+of the vector is inaccessible, then a fault will be raised as normal. For all
+subsequent lanes faults will be suppressed and the corresponding bit in the


nit

Suggested change

of the vector is inaccessible, then a fault will be raised as normal. For all

subsequent lanes faults will be suppressed and the corresponding bit in the

of the vector is inaccessible, then a fault will be raised as normal. For all subsequent

lanes, if a fault occurs, it will be suppressed and the corresponding bit in the

MacDue · 2025-11-12T13:05:24Z

llvm/docs/LangRef.rst

+the output mask.
+
+Reasons for marking output elements inactive are processor dependent; it may be
+a genuine fault, e.g. if the range of the data being loaded spans a page


Worth saying a reason for marking lanes inactive could also be lack of support for suppressing faults?

huntergr-arm · 2026-02-10T11:28:46Z

Changed to just focus on first faulting loads, since there's already a .vp evl based intrinsic for this and #179642 should cover the page-based speculation.

github-actions · 2026-02-10T12:44:11Z

🪟 Windows x64 Test Results

132389 tests passed
3009 tests skipped

✅ The build succeeded and all tests passed.

david-arm · 2026-02-10T13:37:08Z

llvm/docs/LangRef.rst

+      declare { <16 x float>, <16 x i1> } @llvm.masked.load.first.fault.v16f32.p0(ptr <ptr>, i32 <alignment>, <16 x i1> <mask>)
+      declare { <2 x double>, <2 x i1> } @llvm.masked.load.first.fault.v2f64.p0(ptr <ptr>, i32 <alignment>, <2 x i1> <mask>)
+      ;; The data is a vector of pointers
+      declare { <8 x ptr>, <8 x i1> } @llvm.masked.load.first.fault.v8p0.p0(ptr <ptr>, i32 <alignment>, <8 x i1> <mask>)


Do we need a passthru argument similar to the generic masked loads? cc @paulwalker-arm

I'm not sure we want a passthru for other lanes, given that we will likely stop processing anything past a 'fault' for something loaded to determine early exit conditions in the current vector iteration?

The equivalent evl intrinsic (llvm.vp.load.ff) doesn't have a passthru either.

I agree, defining the other lanes as poison seems fine. If users want to set them to a known value they can use a select instruction to set them.

fhahn

IIUC the new intrinsic has the same semantics as llvm.vp.load.ff, modulo the extra EVL argument?

If so, it would be nice if it would be possible to unify them if possible, not sure if we could have an optional EVL argument, which defaults to the number of elements of the vector type if omitted?

If that's not possible, it would be good to try to keep the names consistent (load.ff vs load.first.fault) and possibly share the wording for the semantics definition somehow, possibly referencing vp.load.ff

huntergr-arm · 2026-02-12T13:46:37Z

IIUC the new intrinsic has the same semantics as llvm.vp.load.ff, modulo the extra EVL argument?

If so, it would be nice if it would be possible to unify them if possible, not sure if we could have an optional EVL argument, which defaults to the number of elements of the vector type if omitted?

If that's not possible, it would be good to try to keep the names consistent (load.ff vs load.first.fault) and possibly share the wording for the semantics definition somehow, possibly referencing vp.load.ff

Sure, I can use .ff, I just figured it might be a little more obvious what it was spelled out. Looks like I can also remove the alignment arg and just use metadata from the ptr arg.

It returns a mask instead of a new evl, but that mask would be a 'prefix' mask so could be synthesized from the returned evl.

As far as the overlap in functionality goes, it's pretty much the same as llvm.masked.load vs. llvm.vp.load, or store,gather,scatter,etc.

Would operand bundles be the way to make an argument optional?

At some point, Arm would probably also want to introduce a llvm.masked.gather.ff, as well as llvm.masked.load.nf no-fault version to enable interleaving; I don't think RVV has those operations (yet?).

fhahn · 2026-02-17T14:18:46Z

IIUC the new intrinsic has the same semantics as llvm.vp.load.ff, modulo the extra EVL argument?
If so, it would be nice if it would be possible to unify them if possible, not sure if we could have an optional EVL argument, which defaults to the number of elements of the vector type if omitted?
If that's not possible, it would be good to try to keep the names consistent (load.ff vs load.first.fault) and possibly share the wording for the semantics definition somehow, possibly referencing vp.load.ff

Sure, I can use .ff, I just figured it might be a little more obvious what it was spelled out. Looks like I can also remove the alignment arg and just use metadata from the ptr arg.

It returns a mask instead of a new evl, but that mask would be a 'prefix' mask so could be synthesized from the returned evl.

As far as the overlap in functionality goes, it's pretty much the same as llvm.masked.load vs. llvm.vp.load, or store,gather,scatter,etc.

Would operand bundles be the way to make an argument optional?

It would be possible to make it a variable argument intrinsic, but that isn't nice I think unfortunately, as it would require spelling out the full function type for each call IIRC. Operand bundles may be a way, but I am not sure if they are super well supported everywhere.

Maybe for the new intrinsic as first step, it would be sufficient to share the description of the semantics, and link between them in the LangRef?

huntergr-arm · 2026-02-17T16:58:49Z

Renamed to use a .ff suffix, moved alignment to a pointer argument attribute, reduced semantics section to point to vp.load.ff and a brief explanation of the differences.

In order to support loading from addresses which may not be valid at runtime without generating faults, we introduce the first fault load intrinsic. Loading with this intrinsic will only generate a fault for invalid accesses on the first element of the vector. Any subsequent fault will be suppressed and the corresponding data will be poison. This PR contains target-independent scalarization of the intrinsic so that generic codegen works.

lukel97

Thanks, I think this would help us with #151300 too since we would be able to do the EVL lowering the usual way by lowering to the masked version first, and then converting to the VP intrinsic.

lukel97 · 2026-03-29T04:16:11Z

llvm/docs/LangRef.rst

+The '``llvm.masked.load.ff``' intrinsic is very similar to the
+'``llvm.vp.load.ff``' intrinsic, with the differences being the lack of an EVL
+parameter and the second returned value being a mask instead of an updated EVL
+value.
+


Given that VP intrinsics are usually defined as a "layer above" masked intrinsics, should we instead modify the documentation llvm.vp.load.ff to mention it's similar to llvm.masked.load.ff instead?

lukel97 · 2026-03-29T04:17:51Z

llvm/docs/LangRef.rst

+If the processor suppresses a fault for any lane, then the returned mask will
+indicate that lane and all subsequent lanes are inactive.


Does this mean that the mask is a prefix mask? I.e. after the first set bit, every bit after will be set. I think we will need to depend on that property in the loop vectorizer. Does SVE guarantee that?

lukel97 · 2026-03-29T04:27:41Z

llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp

We tried to add expansion for vp.load.ff too in #154440 but I'm not sure if it's worth it.

The scalarization for the vp intrinsic works the same way as here where we rely on the fact that the intrinsic can arbitrarily reduce the number of non-faulted lanes to one.

But in that case I don't think it will ever be profitable to scalarize these intrinsics, because it means that a vectorized loop will always just be stepping through one element at a time. I.e. it's now a scalar loop.

I think we should just be returning an invalid cost if the target can't lower these.

huntergr-arm requested review from david-arm, fhahn, preames and sdesmalen-arm September 2, 2025 15:02

david-arm reviewed Sep 9, 2025

View reviewed changes

huntergr-arm force-pushed the speculative-load-intrinsic branch from edd26c4 to b76e940 Compare October 30, 2025 12:36

llvmbot added backend:AArch64 llvm:ir llvm:transforms labels Oct 30, 2025

huntergr-arm changed the title ~~[IR][AArch64] Add llvm.masked.speculative.load intrinsic~~ [IR] Add llvm.masked.speculative.load intrinsic Oct 30, 2025

MacDue reviewed Nov 12, 2025

View reviewed changes

huntergr-arm force-pushed the speculative-load-intrinsic branch from 364c8c4 to 7a8355f Compare February 10, 2026 11:24

huntergr-arm changed the title ~~[IR] Add llvm.masked.speculative.load intrinsic~~ [IR] Add llvm.masked.load.first.fault intrinsic Feb 10, 2026

david-arm reviewed Feb 10, 2026

View reviewed changes

fhahn reviewed Feb 12, 2026

View reviewed changes

huntergr-arm force-pushed the speculative-load-intrinsic branch from 1bc1d81 to 15eded6 Compare February 17, 2026 16:55

huntergr-arm added 7 commits March 17, 2026 15:57

Reword langref entry based on comments

c1492bb

Try to fix windows build

ca5d3a6

Renamed intrinsic

c517f9f

Moved alignment to parameter attribute

dcb8f37

Replace semantics section with link to vp.load.ff + differences

15491cf

Update tests after rebase

bed61d9

huntergr-arm force-pushed the speculative-load-intrinsic branch from 15eded6 to bed61d9 Compare March 17, 2026 16:32

lukel97 reviewed Mar 29, 2026

View reviewed changes

		of the vector is inaccessible, then a fault will be raised as normal. For all
		subsequent lanes faults will be suppressed and the corresponding bit in the

		If the processor suppresses a fault for any lane, then the returned mask will
		indicate that lane and all subsequent lanes are inactive.

Conversation

huntergr-arm commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huntergr-arm commented Oct 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huntergr-arm commented Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪟 Windows x64 Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

huntergr-arm commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhahn commented Feb 17, 2026

Uh oh!

huntergr-arm commented Feb 17, 2026

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

huntergr-arm commented Sep 2, 2025 •

edited

Loading

llvmbot commented Oct 30, 2025 •

edited

Loading

github-actions bot commented Feb 10, 2026 •

edited

Loading

huntergr-arm commented Feb 12, 2026 •

edited

Loading